The following is your first chunk to start with. Remember, you can add chunks using the menu above (Insert -> R) or using the keyboard shortcut Ctrl+Alt+I. A good practice is to use different code chunks to answer different questions. You can delete this comment if you like.
Other useful keyboard shortcuts include Alt- for the assignment operator, and Ctrl+Shift+M for the pipe operator. You can delete these reminders if you don’t want them in your report.
library("CARS")
library("lubridate")
Attaching package: ‘lubridate’
The following object is masked from ‘package:base’:
date
dfw <- read_csv("WalmartSales.csv")
Parsed with column specification:
cols(
Store = [32mcol_double()[39m,
Date = [34mcol_date(format = "")[39m,
IsHoliday = [33mcol_logical()[39m,
Temperature = [32mcol_double()[39m,
Fuel_Price = [32mcol_double()[39m,
CPI = [32mcol_double()[39m,
Unemployment = [32mcol_double()[39m,
Size = [32mcol_double()[39m,
Weekly_Sales = [32mcol_double()[39m
)
dfw
summary(dfw)
Store Date IsHoliday Temperature Fuel_Price CPI Unemployment
Min. : 1 Min. :2010-02-05 Mode :logical Min. : -2.06 Min. :2.472 Min. :126.1 Min. : 3.879
1st Qu.:12 1st Qu.:2010-10-08 FALSE:5985 1st Qu.: 47.46 1st Qu.:2.933 1st Qu.:131.7 1st Qu.: 6.891
Median :23 Median :2011-06-17 TRUE :450 Median : 62.67 Median :3.445 Median :182.6 Median : 7.874
Mean :23 Mean :2011-06-17 Mean : 60.66 Mean :3.359 Mean :171.6 Mean : 7.999
3rd Qu.:34 3rd Qu.:2012-02-24 3rd Qu.: 74.94 3rd Qu.:3.735 3rd Qu.:212.7 3rd Qu.: 8.622
Max. :45 Max. :2012-10-26 Max. :100.14 Max. :4.468 Max. :227.2 Max. :14.313
Size Weekly_Sales
Min. : 34875 Min. : 68982
1st Qu.: 70713 1st Qu.: 375614
Median :126512 Median : 639652
Mean :130288 Mean : 701560
3rd Qu.:202307 3rd Qu.: 958807
Max. :219622 Max. :2773216
fitCPI <- lm(formula = Weekly_Sales ~ CPI, data = dfw)
summary(fitCPI)
Call:
lm(formula = Weekly_Sales ~ CPI, data = dfw)
Residuals:
Min 1Q Median 3Q Max
-662386 -318443 -73868 258442 2095880
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 827280.5 21778.4 37.986 < 2e-16 ***
CPI -732.7 123.7 -5.923 3.33e-09 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 390600 on 6433 degrees of freedom
Multiple R-squared: 0.005423, Adjusted R-squared: 0.005269
F-statistic: 35.08 on 1 and 6433 DF, p-value: 3.332e-09
plotStore10 <- dfw %>% filter(Store == 10) %>%
ggplot(mapping = aes(x = CPI, y = Weekly_Sales)) +
geom_point() +
geom_smooth(method = lm)
ggplotly(plotStore10)
plotStore10
plotStore11 <- dfw %>% filter(Store == 11 ) %>%
ggplot(mapping = aes(x = CPI, y = Weekly_Sales)) +
geom_point() +
geom_smooth(method = lm)
ggplotly(plotStore11)
plotStore11
plotStore12 <- dfw %>% filter(Store == 12 ) %>%
ggplot(mapping = aes(x = CPI, y = Weekly_Sales)) +
geom_point() +
geom_smooth(method = lm)
ggplotly(plotStore12)
plotStore12
plotStore13 <- dfw %>% filter(Store == 13 ) %>%
ggplot(mapping = aes(x = CPI, y = Weekly_Sales)) +
geom_point() +
geom_smooth(method = lm)
ggplotly(plotStore13)
plotStore13
#3. Now, filter for the year 2012 instead of a store (so, you’ll plot data from all stores in a year). For this, you will need to (install and) load the lubridate library. Check the cheat sheet for lubridate here. [ Start by copying/pasting your code from Q2 into a new chunk and reuse ]
#What do you observe? Why do you think there are almost vertical clusters of observations?
plotYear <- dfw %>% filter(year(Date)== 2012) %>%
ggplot(mapping = aes(x = CPI, y = Weekly_Sales)) +
geom_point() +
geom_smooth(method = lm)
plotYear
NA
NA
#4. Now, create a plot of sales in Store 1 in the year 2010. Did you know that you can use multiple arguments in one filter function as follows: filter(argument_1, argument_2,…)?
#Compared to the earlier plots, do you notice a difference in the range of CPI? Why is it so?
dfw %>% filter(year(Date)== 2010 & Store== 1) %>%
ggplot(mapping = aes(x = CPI, y = Weekly_Sales)) +
geom_point() +
geom_smooth(method = lm)
NA
NA
#5. Build another regression model but this time include both CPI and Size as independent variables and call it fitCPISize. Compare this model with the model you built in Q1.
#Which model is better at explaining Weekly Sales? Why? Hint: Use anova() as well.
fitCPISize <- lm(formula = Weekly_Sales ~ CPI + Size, data = dfw)
anova(fitCPISize)
Analysis of Variance Table
Response: Weekly_Sales
Df Sum Sq Mean Sq F value Pr(>F)
CPI 1 5.3507e+12 5.3507e+12 90.748 < 2.2e-16 ***
Size 1 6.0204e+14 6.0204e+14 10210.637 < 2.2e-16 ***
Residuals 6432 3.7924e+14 5.8962e+10
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
summary(fitCPISize)
Call:
lm(formula = Weekly_Sales ~ CPI + Size, data = dfw)
Residuals:
Min 1Q Median 3Q Max
-563750 -167145 -29612 112172 1912650
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.828e+05 1.497e+04 12.216 <2e-16 ***
CPI -6.570e+02 7.692e+01 -8.542 <2e-16 ***
Size 4.847e+00 4.796e-02 101.048 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 242800 on 6432 degrees of freedom
Multiple R-squared: 0.6156, Adjusted R-squared: 0.6155
F-statistic: 5151 on 2 and 6432 DF, p-value: < 2.2e-16
fitFull <- lm(formula = Weekly_Sales ~ CPI + Size + IsHoliday + Temperature + Fuel_Price + Unemployment, data = dfw)
summary(fitFull)
Call:
lm(formula = Weekly_Sales ~ CPI + Size + IsHoliday + Temperature +
Fuel_Price + Unemployment, data = dfw)
Residuals:
Min 1Q Median 3Q Max
-557148 -165608 -24125 112851 1918479
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.133e+05 3.546e+04 8.834 < 2e-16 ***
CPI -9.461e+02 8.445e+01 -11.203 < 2e-16 ***
Size 4.840e+00 4.802e-02 100.786 < 2e-16 ***
IsHolidayTRUE 6.012e+04 1.196e+04 5.026 5.14e-07 ***
Temperature 1.002e+03 1.739e+02 5.761 8.72e-09 ***
Fuel_Price -1.333e+04 6.822e+03 -1.954 0.0507 .
Unemployment -1.252e+04 1.725e+03 -7.258 4.40e-13 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 241200 on 6428 degrees of freedom
Multiple R-squared: 0.621, Adjusted R-squared: 0.6206
F-statistic: 1755 on 6 and 6428 DF, p-value: < 2.2e-16
anova(fitFull)
Analysis of Variance Table
Response: Weekly_Sales
Df Sum Sq Mean Sq F value Pr(>F)
CPI 1 5.3507e+12 5.3507e+12 91.9780 < 2.2e-16 ***
Size 1 6.0204e+14 6.0204e+14 10348.9922 < 2.2e-16 ***
IsHoliday 1 1.0402e+12 1.0402e+12 17.8814 2.384e-05 ***
Temperature 1 1.1309e+12 1.1309e+12 19.4403 1.055e-05 ***
Fuel_Price 1 6.7442e+10 6.7442e+10 1.1593 0.2816
Unemployment 1 3.0642e+12 3.0642e+12 52.6735 4.403e-13 ***
Residuals 6428 3.7394e+14 5.8173e+10
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#8. The output of Q7 shows that temperature is positively associated with weekly sales. However, is that relationship really linear? Test it out by adding a squared transformation of temperature into the model using the following I(Temperature^2) and call it fitFullTemp
#What is the coefficient of the squared term? Is it statistically significant? What does it mean? Based on this, what would you do differently if you were managing Walmart’s promotions?
fitFullTemp <- lm(formula = Weekly_Sales ~ CPI + Size + IsHoliday + Fuel_Price + Temperature + Unemployment + I(Temperature^2), data = dfw)
summary(fitFullTemp)
Call:
lm(formula = Weekly_Sales ~ CPI + Size + IsHoliday + Fuel_Price +
Temperature + Unemployment + I(Temperature^2), data = dfw)
Residuals:
Min 1Q Median 3Q Max
-561455 -165260 -24674 112058 1911166
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.610e+05 4.111e+04 6.350 2.30e-10 ***
CPI -9.547e+02 8.449e+01 -11.300 < 2e-16 ***
Size 4.831e+00 4.811e-02 100.420 < 2e-16 ***
IsHolidayTRUE 6.230e+04 1.199e+04 5.197 2.09e-07 ***
Fuel_Price -1.471e+04 6.841e+03 -2.151 0.0315 *
Temperature 3.294e+03 9.301e+02 3.542 0.0004 ***
Unemployment -1.253e+04 1.724e+03 -7.268 4.09e-13 ***
I(Temperature^2) -1.982e+01 7.901e+00 -2.509 0.0121 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 241100 on 6427 degrees of freedom
Multiple R-squared: 0.6214, Adjusted R-squared: 0.621
F-statistic: 1507 on 7 and 6427 DF, p-value: < 2.2e-16
dfw %>% ggplot(aes(x= Temperature, y= Weekly_Sales)) +
geom_smooth(method= lm, formula = y ~ x + I(x^2))
#dplyr::setdiff #detach(‘package:lubridate’, unload = TRUE) #When sometimes the package masks the functions we need to unload it.
set.seed(333)
dfwTrain <- dfw %>% sample_frac(0.8)
dfwTest <- dplyr::setdiff(dfw, dfwTrain)
fitOrg <- lm(formula = Weekly_Sales ~ CPI + Size + IsHoliday + Fuel_Price + Temperature + Unemployment + I(Temperature^2), data = dfwTrain)
fitOrg
Call:
lm(formula = Weekly_Sales ~ CPI + Size + IsHoliday + Fuel_Price +
Temperature + Unemployment + I(Temperature^2), data = dfwTrain)
Coefficients:
(Intercept) CPI Size IsHolidayTRUE Fuel_Price Temperature
283095.679 -994.622 4.824 57721.650 -11839.300 2614.925
Unemployment I(Temperature^2)
-12726.699 -14.466
rmse(resultsOrg, truth= Weekly_Sales, estimate = predictedSales)
mae(resultsOrg, truth= Weekly_Sales, estimate= predictedSales)
#fitFull <- lm(formula = Weekly_Sales ~ CPI + Size + IsHoliday + Temperature + Fuel_Price + Unemployment, data = dfw)
#?rmse
performance <- metric_set(rmse, mae)
performance
function (data, truth, estimate, na_rm = TRUE, ...)
{
call_args <- quos(data = data, truth = !!enquo(truth), estimate = !!enquo(estimate),
na_rm = na_rm, ... = ...)
calls <- lapply(fns, call2, !!!call_args)
metric_list <- mapply(FUN = eval_safely, calls, names(calls),
SIMPLIFY = FALSE, USE.NAMES = FALSE)
bind_rows(metric_list)
}
<bytecode: 0x7fc479be4138>
<environment: 0x7fc479be18e8>
attr(,"class")
[1] "numeric_metric_set" "metric_set" "function"
attr(,"metrics")
attr(,"metrics")$rmse
function (data, ...)
{
UseMethod("rmse")
}
<bytecode: 0x7fc47ef0db28>
<environment: namespace:yardstick>
attr(,"class")
[1] "numeric_metric" "function"
attr(,"direction")
[1] "minimize"
attr(,"metrics")$mae
function (data, ...)
{
UseMethod("mae")
}
<bytecode: 0x7fc47eeb45f0>
<environment: namespace:yardstick>
attr(,"class")
[1] "numeric_metric" "function"
attr(,"direction")
[1] "minimize"
performance(data= resultsOrg, truth= Weekly_Sales, estimate= predictedSales)
NA
NA
fitOrgDate <- lm(formula = Weekly_Sales ~ CPI + Size + IsHoliday + Temperature + Date+ Fuel_Price + Unemployment, data = dfw)
fitOrgDate
Call:
lm(formula = Weekly_Sales ~ CPI + Size + IsHoliday + Temperature +
Date + Fuel_Price + Unemployment, data = dfw)
Coefficients:
(Intercept) CPI Size IsHolidayTRUE Temperature Date Fuel_Price
25029.055 -970.072 4.842 58776.597 992.061 21.388 -23991.947
Unemployment
-11924.759
summary(fitOrgDate)
Call:
lm(formula = Weekly_Sales ~ CPI + Size + IsHoliday + Temperature +
Date + Fuel_Price + Unemployment, data = dfw)
Residuals:
Min 1Q Median 3Q Max
-553184 -166171 -24051 113175 1919129
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.503e+04 2.469e+05 0.101 0.9192
CPI -9.701e+02 8.686e+01 -11.168 < 2e-16 ***
Size 4.842e+00 4.804e-02 100.773 < 2e-16 ***
IsHolidayTRUE 5.878e+04 1.202e+04 4.892 1.02e-06 ***
Temperature 9.921e+02 1.740e+02 5.700 1.25e-08 ***
Date 2.139e+01 1.813e+01 1.180 0.2381
Fuel_Price -2.399e+04 1.132e+04 -2.119 0.0341 *
Unemployment -1.192e+04 1.796e+03 -6.639 3.42e-11 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 241200 on 6427 degrees of freedom
Multiple R-squared: 0.6211, Adjusted R-squared: 0.6207
F-statistic: 1505 on 7 and 6427 DF, p-value: < 2.2e-16
performance(data= resultsOrgDate, truth= Weekly_Sales, estimate= predictedSales)
NA
NA
#10.he finale has to be sweet, right? Instead of using sales, create a log-transformed version, set the seed, split the data, run the model fitLog, make predictions, calculate performance. #Have the coefficient estimates and variance explained in DV improved? Compare the model output and performance of fitLog with that of fitOrg from Q9c, and discuss. #Check and compare the diagnostics from fitLog with those from fitOrg, and discuss.
set.seed(333)
dfwTrainLog <- dfw %>% sample_frac(0.8)
dfwTestLog <- dplyr::setdiff(dfw, dfwTrainLog)
fitLog <- lm(formula = log(Weekly_Sales) ~ CPI + Size + IsHoliday + Temperature + Fuel_Price + Unemployment, data = dfwTrain)
fitLog
Call:
lm(formula = log(Weekly_Sales) ~ CPI + Size + IsHoliday + Temperature +
Fuel_Price + Unemployment, data = dfwTrain)
Coefficients:
(Intercept) CPI Size IsHolidayTRUE Temperature Fuel_Price Unemployment
1.248e+01 -1.261e-03 8.087e-06 6.265e-02 2.839e-04 3.395e-03 -7.354e-03
summary(fitLog)
Call:
lm(formula = log(Weekly_Sales) ~ CPI + Size + IsHoliday + Temperature +
Fuel_Price + Unemployment, data = dfwTrain)
Residuals:
Min 1Q Median 3Q Max
-1.48085 -0.22739 -0.02379 0.22753 1.47557
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.248e+01 5.497e-02 227.067 < 2e-16 ***
CPI -1.261e-03 1.301e-04 -9.691 < 2e-16 ***
Size 8.087e-06 7.407e-08 109.172 < 2e-16 ***
IsHolidayTRUE 6.265e-02 1.843e-02 3.399 0.000681 ***
Temperature 2.839e-04 2.688e-04 1.056 0.290878
Fuel_Price 3.395e-03 1.055e-02 0.322 0.747646
Unemployment -7.354e-03 2.677e-03 -2.747 0.006040 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.3339 on 5141 degrees of freedom
Multiple R-squared: 0.7054, Adjusted R-squared: 0.7051
F-statistic: 2052 on 6 and 5141 DF, p-value: < 2.2e-16
#Bonus question: Instead of predicting sales, you may also want to create a new dependent variable by dividing the Weekly Sales by store Size (“Sales per square foot” -makes sense if you focus on the utilization of store space, for example). Call it fitSalesSqFoot. For this exercise, like in Q10, create a variable, set the seed, split the data, make predictions, calculate performance. What do you think is going on here? Discuss. In addition, in this model, you may want to try removing the variable Size, because your DV is a function of it now. Explore the differences.
dfw1 <- dfw %>% mutate(SalesSqFoot = Weekly_Sales/Size)
dfw1
set.seed(333)
dfwTrain1 <- dfw1 %>% sample_frac(0.8)
dfwTest1 <- dplyr::setdiff(dfw1, dfwTrain1)
fitSqFoot <- lm(formula = SalesSqFoot ~ CPI + IsHoliday + Temperature + Unemployment+ Fuel_Price, data = dfw1)
fitSqFoot
summary(fitSqFoot)
resultsSqFoot <-dfwTest1 %>%
mutate(predictedSales = predict(fitSqFoot, dfwTest1))
resultsSqFoot
performance(data=resultsSqFoot, truth= SalesSqFoot, estimate= predictedSales)